Name: Cindy Do, Faris Makarim, Ghipsel Cibrian Lopez

library(gapminder)
library(ggplot2)
#Start by playing around with the gapminder data a little more. You can try each of these explorations with geom_point() and then with geom_smooth(), or both together. What happens when you put the geom_smooth() function before geom_point() instead of after it? What does this tell you about how the plot is drawn? Think about how this might be useful when drawing plots.

ggplot(gapminder, aes(x=gdpPercap, y=lifeExp, color=continent))+
  geom_point()

ggplot(gapminder, aes(x=gdpPercap, y=lifeExp, color=continent))+
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(gapminder, aes(x=gdpPercap, y=lifeExp, color=continent))+
  geom_smooth()+
  geom_point()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(gapminder, aes(x=gdpPercap, y=lifeExp, color=continent))+
  geom_point()+
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#when smooth is written before point, the smooth line is on the backside of the point. Hence, we are creating layer by layer in accordance to the sequence we write in the code

#Change the mappings in the aes() function so that you plot Life Expectancy against population (pop) rather than per capita GDP. What does that look like? What does it tell you about the unit of observation in the dataset?

ggplot(gapminder, aes(x=pop, y=lifeExp, color=continent))+
  geom_point()+
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#it looks like the scatter is very concentrated on low pop samples but we also have outliers with very high pop. It looks like we have an inappropriate scaling

#Try some alternative scale mappings. Besides scale_x_log10() you can try scale_x_sqrt() and scale_x_reverse(). There are corresponding functions for y-axis transformations. Just write y instead of x. Experiment with them to see what sort of effect they have on the plot, and whether they make any sense to use.

ggplot(gapminder, aes(x=pop, y=lifeExp, color=continent))+
  geom_point()+
  geom_smooth()+
  scale_x_log10()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(gapminder, aes(x=pop, y=lifeExp, color=continent))+
  geom_point()+
  geom_smooth()+
  scale_x_sqrt()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(gapminder, aes(x=pop, y=lifeExp, color=continent))+
  geom_point()+
  geom_smooth()+
  scale_x_reverse()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(gapminder, aes(x=pop, y=lifeExp, color=continent))+
  geom_point()+
  geom_smooth()+
  scale_y_log10()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(gapminder, aes(x=pop, y=lifeExp, color=continent))+
  geom_point()+
  geom_smooth()+
  scale_y_sqrt()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(gapminder, aes(x=pop, y=lifeExp, color=continent))+
  geom_point()+
  geom_smooth()+
  scale_y_reverse()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#the scale_x_log10() makes the most sense as we saw that the observations are concentrated in the very low x. So, changing the x axis to log scale is the most appropriate

#What happens if you map color to year instead of continent? Is the result what you expected? Think about what class of object year is. Remember you can get a quick look at the top of the data, which includes some shorthand information on the class of each variable, by typing gapminder.

ggplot(gapminder, aes(x=pop, y=lifeExp, color=year))+
  geom_point()+
  geom_smooth()+
  scale_x_log10()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

#the points in the scatterplot are colored by the gradient of year. This happens because year is a integer that spans from 1952 to 2007

#Instead of mapping color = year, what happens if you try color = factor(year)?
ggplot(gapminder, aes(x=pop, y=lifeExp, color=factor(year)))+
  geom_point()+
  geom_smooth()+
  scale_x_log10()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#now we are treating year like a categorical variable instead of continuous like the previous one

#As you look at these different scatterplots, think about Figure 3.13 a little more critically. We worked it up to the point where it was reasonably polished, but is it really the best way to display this country-year data? What are we gaining and losing by ignoring the temporal and country-level structure of the data? How could we do better? Sketch out what an alternative visualization might look like

#we gain the overall trend of lifeexp vs gdppercapita, but then we can't see the difference between country or continent. We can create another one with color-code-ing country and continent

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp, color=continent))

p + geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se=F) +
  scale_x_log10(labels = scales::dollar) +
  labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
  title = "Economic Growth and Life Expectancy",
  subtitle = "Data points are country-years",
  caption = "Source: Gapminder.")
## `geom_smooth()` using formula = 'y ~ x'

#Revisit the gapminder plots at the beginning of the chapter and experiment with different ways to facet the data. Try plotting population and per capita GDP while faceting on year, or even on country. In the latter case you will get a lot of panels, and plotting them straight to the screen may take a long time. Instead, assign the plot to an object and save it as a PDF file to your figures/ folder. Experiment with the height and width of the figure.
library(socviz)

ggplot(gapminder, aes(x=pop, y=gdpPercap))+
  geom_point()+
  geom_smooth()+
  scale_x_log10()+
  facet_wrap(~year)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

p<-ggplot(gapminder, aes(x=pop, y=gdpPercap))+
  geom_point()+
  geom_smooth()+
  scale_x_log10()+
  facet_wrap(~country)
ggsave("my_figure.pdf", plot = p)
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Investigate the difference between a formula written as facet_grid(sex ~ race) versus one written as facet_grid(~ sex + race).

ggplot(gss_sm, aes(x=age, y=childs))+
  geom_point()+
  geom_smooth()+
  facet_grid(sex~race)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 18 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 18 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot(gss_sm, aes(x=age, y=childs))+
  geom_point()+
  geom_smooth()+
  facet_grid(~sex+race)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 18 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Removed 18 rows containing missing values or values outside the scale range
## (`geom_point()`).

#in the first one, the sex facets are placed in the vertical axis while the race is in the horizontal. In the second one, all sex and race are placed in the horizontal axis of the facets

#Experiment to see what happens when you use facet_wrap() with more complex forumulas like facet_wrap(~ sex + race) instead of facet_grid. Like facet_grid(), the facet_wrap() function can facet on two or more variables at once. But it will do it by laying the results out in a wrapped one-dimensional table instead of a fully crossclassified grid.

ggplot(gss_sm, aes(x=age, y=childs))+
  geom_point()+
  geom_smooth()+
  facet_wrap(~sex+race)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 18 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Removed 18 rows containing missing values or values outside the scale range
## (`geom_point()`).

#it is placing sex and race in the horizontal axis, but wrap will make the facet organized into 2 rows instead of 1 in the grid.

#Frequency polygons are closely related to histograms. Instead of displaying the count of observations using bars, they display it with a series of connected lines instead. You can try the various geom_histogram() calls in this chapter using geom_freqpoly() instead.

ggplot(midwest, aes(x=area))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(midwest, aes(x=area))+
  geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#A histogram bins observations for one variable and shows a bars with the count in each bin. We can do this for two variables at once, too. The geom_bin2d() function takes two mappings, x and y. It divides your plot into a grid and colors the bins by the count of observations in them. Try using it on the gapminder data to plot life expectancy versus per capita GDP. Like a histogram, you can vary the number or width of the bins for both x or y. Instead of saying bins = 30 or binwidth = 1, provide a number for both x and y with, for example, bins = c(20, 50). If you specify bindwith instead, you will need to pick values that are on the same scale as the variable you are mapping.
ggplot(gapminder, aes(x=lifeExp, y=gdpPercap))+
  geom_bin2d(bins=c(20,50))

#Density estimates can also be drawn in two dimensions. The geom_density_2d() function draws contour lines estimating the joint distribution of two variables. Try it with the midwest data, for example, plotting percent below the poverty line (percbelowpoverty) against percent college-educated (percollege). Try it with and without a geom_point() layer.
ggplot(midwest,aes(x=percbelowpoverty,y=percollege))+
  geom_density_2d()

ggplot(midwest,aes(x=percbelowpoverty,y=percollege))+
  geom_density_2d()+
  geom_point()

#We covered several new functions and data aggregation techniques in this Chapter. You should practice working with them.

#The subset() function is very useful when used in conjunction with a series of layered geoms. Go back to your code for the Presidential Elections plot (Figure 5.18) and redo it so that it shows all the data points but only labels elections since 1992. You might need to look again at the elections_historic data to see what variables are available to you. You can also experiment with subsetting by political party, or changing the colors of the points to reflect the winning party.

library(socviz)
library(ggrepel)

p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional."
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"

p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct))

p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
    geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
    geom_point() +
    geom_text_repel(data=subset(elections_historic, year>=1992), mapping=aes(label = winner_label)) +
    scale_x_continuous(labels = scales::percent) +
    scale_y_continuous(labels = scales::percent) +
    labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle,
         caption = p_caption)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
    geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
    geom_point() +
    geom_text_repel(data=subset(elections_historic, win_party=="Dem."), mapping=aes(label = winner_label)) +
    scale_x_continuous(labels = scales::percent) +
    scale_y_continuous(labels = scales::percent) +
    labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle,
         caption = p_caption)

p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
    geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
    geom_point(aes(color=win_party)) +
    geom_text_repel(elections_historic, mapping=aes(label = winner_label)) +
    scale_x_continuous(labels = scales::percent) +
    scale_y_continuous(labels = scales::percent) +
    labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle,
         caption = p_caption)
## Warning: ggrepel: 17 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

#Use geom_point() and reorder() to make a Cleveland dot plot of all Presidential elections, ordered by share of the popular vote.

p <- ggplot(elections_historic,
            aes(x = popular_pct, y = reorder(year, popular_pct)))
p + geom_point(size=3) +
    labs(x = "Share of popular vote",
         y = "Year") +
    theme(legend.position="top")

#Try using annotate() to add a rectangle that lightly colors the entire upper left quadrant of Figure 5.18.

p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct))
p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
    geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
    geom_point() +
    geom_text_repel(elections_historic, mapping=aes(label = winner_label)) +
    scale_x_continuous(labels = scales::percent) +
    scale_y_continuous(labels = scales::percent) +
    labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle,
         caption = p_caption)+
    annotate(geom = "rect", xmin = 0.3, xmax = 0.5,
             ymin = 0.5, ymax = 1, fill = "red", alpha = 0.2)
## Warning: ggrepel: 15 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

#The main action verbs in the dplyr library are group_by(), filter(), select(), summarize(), and mutate(). Practice with them by revisiting the gapminder data to see if you can reproduce a pair of graphs from Chapter One, shown here again in Figure 5.28. You will need to filter some rows, group the data by continent, and calculate the mean life expectancy by continent before beginning the plotting process.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
gapminder %>%
  filter(year==2007)%>%
  select(country, continent, lifeExp)%>%
  group_by(continent)%>%
  summarize(mean=mean(lifeExp))%>%
              ggplot(aes(x=mean,y=reorder(continent,mean)))+
              geom_point()

gapminder %>%
  filter(year==2007)%>%
  select(country, continent, lifeExp)%>%
  group_by(continent)%>%
  summarize(mean=mean(lifeExp))%>%
              ggplot()+
              geom_col(aes(x=mean,y=reorder(continent,mean)))

#Get comfortable with grouping, mutating, and summarizing data in pipelines. This will become a routine task as you work with your data. There are many ways that tables can be aggregated and transformed. Remember group_by() groups your data from left to right, with the rightmost or innermost group being the level calculations will be done at; mutate() adds a column at the current level of grouping; and summarize() aggregates to the next level up. Try creating some grouped objects from the GSS data, calculating frequencies as you learned in this Chapter, and then check to see if the totals are what you expect. For example, start by grouping degree by race, like this:
  
  gss_sm %>% group_by(race, degree) %>%
    summarize(N = n()) %>%
    mutate(pct = round(N / sum(N)*100, 0)) 
## `summarise()` has grouped output by 'race'. You can override using the
## `.groups` argument.
## # A tibble: 18 × 4
## # Groups:   race [3]
##    race  degree             N   pct
##    <fct> <fct>          <int> <dbl>
##  1 White Lt High School   197     9
##  2 White High School     1057    50
##  3 White Junior College   166     8
##  4 White Bachelor         426    20
##  5 White Graduate         250    12
##  6 White <NA>               4     0
##  7 Black Lt High School    60    12
##  8 Black High School      292    60
##  9 Black Junior College    33     7
## 10 Black Bachelor          71    14
## 11 Black Graduate          31     6
## 12 Black <NA>               3     1
## 13 Other Lt High School    71    26
## 14 Other High School      112    40
## 15 Other Junior College    17     6
## 16 Other Bachelor          39    14
## 17 Other Graduate          37    13
## 18 Other <NA>               1     0
## # A tibble: 18 x 4
## # Groups:   race [3]
##    race  degree             N   pct
##    <fct> <fct>          <int> <dbl>
##  1 White Lt High School   197  9.00
##  2 White High School     1057 50.0 
##  3 White Junior College   166  8.00
##  4 White Bachelor         426 20.0 
##  5 White Graduate         250 12.0 
##  6 White <NA>               4  0   
##  7 Black Lt High School    60 12.0 
##  8 Black High School      292 60.0 
##  9 Black Junior College    33  7.00
## 10 Black Bachelor          71 14.0 
## 11 Black Graduate          31  6.00
## 12 Black <NA>               3  1.00
## 13 Other Lt High School    71 26.0 
## 14 Other High School      112 40.0 
## 15 Other Junior College    17  6.00
## 16 Other Bachelor          39 14.0 
## 17 Other Graduate          37 13.0 
## 18 Other <NA>               1  0

#This code is similar to what you saw earlier, but a little more compact. (We calculate the pct values directly.) Check the results are as you expect by grouping by race and summing the percentages. Try doing the same exercise grouping by sex or region.
  gss_sm %>% group_by(race)%>%
    summarize(N=n())%>%
    mutate(pct=round(N/sum(N)*100,0))
## # A tibble: 3 × 3
##   race      N   pct
##   <fct> <int> <dbl>
## 1 White  2100    73
## 2 Black   490    17
## 3 Other   277    10
  gss_sm %>% group_by(sex) %>%
    summarize(N=n())%>%
    mutate(pct=round(N/sum(N)*100,0))
## # A tibble: 2 × 3
##   sex        N   pct
##   <fct>  <int> <dbl>
## 1 Male    1276    45
## 2 Female  1591    55
    gss_sm %>% group_by(region) %>%
    summarize(N=n())%>%
    mutate(pct=round(N/sum(N)*100,0))
## # A tibble: 9 × 3
##   region              N   pct
##   <fct>           <int> <dbl>
## 1 New England       175     6
## 2 Middle Atlantic   313    11
## 3 E. Nor. Central   502    18
## 4 W. Nor. Central   193     7
## 5 South Atlantic    550    19
## 6 E. Sou. Central   205     7
## 7 W. Sou. Central   297    10
## 8 Mountain          235     8
## 9 Pacific           397    14
#Try summary calculations with functions other than sum. Can you calculate the mean and median number of children by degree? (Hint: the childs variable in gss_sm has children as a numeric value.)
    
    gss_sm%>% group_by(degree)%>%
      summarize(mean=mean(childs, na.rm=T),median=median(childs,na.rm=T))
## # A tibble: 6 × 3
##   degree          mean median
##   <fct>          <dbl>  <dbl>
## 1 Lt High School  2.81      3
## 2 High School     1.86      2
## 3 Junior College  1.77      2
## 4 Bachelor        1.45      1
## 5 Graduate        1.52      2
## 6 <NA>            3.6       4
#dplyr has a large number of helper functions that let you summarize data in many different ways. The vignette on window functions included with the dplyr documentation is a good place to begin learning about these. You should also look at Chapter 3 of Wickham & Grolemund (2016) for more information on transforming data with dplyr.
    
#Experiment with the gapminder data to practice some of the new geoms we have learned. Try examining population or life expectancy over time using a series of boxplots. (Hint: you may need to use the group aesthetic in the aes() call.) Can you facet this boxplot by continent? Is anything different if you create a tibble from gapminder that explicitly groups the data by year and continent first, and then create your plots with that?
    
  ggplot(gapminder,aes(x=year,y=lifeExp,group=year))+
    geom_boxplot()

    ggplot(gapminder,aes(x=year,y=lifeExp,group=year))+
    geom_boxplot()+
      facet_wrap(~continent)

    gapminder %>%
      group_by(year,continent)%>%
      ggplot(aes(x=year,y=lifeExp,group=year))+
      geom_boxplot()

#no, nothing different

#Read the help page for geom_boxplot() and take a look at the notch and varwidth options. Try them out to see how they change the look of the plot.
?geom_boxplot
## starting httpd help server ...
##  done
    ggplot(gapminder,aes(x=year,y=lifeExp,group=year))+
    geom_boxplot(notch=T)+
      facet_wrap(~continent)
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?

    ggplot(gapminder,aes(x=year,y=lifeExp,group=year))+
    geom_boxplot(notch=T, varwidth=T)+
    facet_wrap(~continent)
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?
## Notch went outside hinges
## ℹ Do you want `notch = FALSE`?

#with varwidth=T, boxes are drawn with widths proportional to the square-roots of the number of observations in the groups.
#with notch=T, there is a notch in the median line

#As an alternative to geom_boxplot() try geom_violin() for a similar plot, but with a mirrored density distribution instead of a box and whiskers.
    
    ggplot(gapminder,aes(x=year,y=lifeExp,group=year))+
    geom_violin()

#geom_pointrange() is one of a family of related geoms that produce different kinds of error bars and ranges, depending on your specific needs. They include geom_linerange(), geom_crossbar(), and geom_errorbar(). Try them out using gapminder or organdata to see how they differ.

    gapminder%>%
    group_by(year)%>%
    mutate(ymin=min(lifeExp),ymax=max(lifeExp))%>%
    ggplot(aes(x=year,y=lifeExp,group=year))+
    geom_pointrange(aes(ymin=ymin, ymax=ymax))

    gapminder%>%
    group_by(year)%>%
    mutate(ymin=min(lifeExp),ymax=max(lifeExp))%>%
    ggplot(aes(x=year,y=lifeExp,group=year))+
    geom_linerange(aes(ymin=ymin, ymax=ymax))

    gapminder%>%
    group_by(year)%>%
    mutate(ymin=min(lifeExp),ymax=max(lifeExp))%>%
    ggplot(aes(x=year,y=lifeExp,group=year))+
    geom_crossbar(aes(ymin=ymin, ymax=ymax))

    gapminder%>%
    group_by(year)%>%
    mutate(ymin=min(lifeExp),ymax=max(lifeExp))%>%
    ggplot(aes(x=year,y=lifeExp,group=year))+
    geom_errorbar(aes(ymin=ymin, ymax=ymax))

#In general, when you estimate models and want to plot the results, the difficult step is not the plotting but rather calculating and extracting the right numbers. Generating predicted values and measures of confidence or uncertainty from models requires that you understand the model you are fitting, and the function you use to fit it, especially when it involves interactions, cross-level effects, or transformations of the predictor or response scales. The details can vary substantially from model type to model type, and also with the goals of any particular analysis. It is unwise to approach them mechanically. That said, several tools exist to help you work with model objects and produce a default set of plots from them.

#6.9.1 Default plots for models
#Just as model objects in R usually have a default summary() method, printing out an overview tailored to the type of model it is, they will usually have a default plot() method, too. Figures produced by plot() are typically not generated via ggplot, but it is usually worth exploring them. They typically make use of either R’s base graphics or the lattice library (Sarkar, 2008). These are two plotting systems that we do not cover in this book. Default plot methods are easy to examine. Let’s take a look again at our simple OLS model.

out <- lm(formula = lifeExp ~ log(gdpPercap) + pop + continent, data = gapminder)
#To look at some of R’s default plots for this model, use the plot() function.

# Plot not shown
plot(out, which = c(1,2), ask=FALSE)

#The which() statement here selects the first two of four default plots for this kind of model. If you want to easily reproduce base R’s default model graphics using ggplot, the ggfortify library is worth examining. It is in some ways similar to broom, in that it tidies the output of model objects, but it focuses on producing a standard plot (or group of plots) for a wide variety of model types. It does this by defining a function called autoplot(). The idea is to be able to use autoplot() with the output of many different kinds of model.

#A second option worth looking at is the coefplot library. It provides a quick way to produce good-quality plots of point estimates and confidence intervals. It has the advantage of managing the estimation of interaction effects and other occasionally tricky calculations.

library(coefplot)
out <- lm(formula = lifeExp ~ log(gdpPercap) + log(pop) + continent, data = gapminder)

coefplot(out, sort = "magnitude", intercept = FALSE)

#6.9.2 Tools in development
#Tidyverse tools for modeling and model exploration are being actively developed. The broom and margins libraries continue to get more and more useful. There are also other projects worth paying attention to. The inferpackage is in its early stages but can already do useful things in a pipeline-friendly way. You can install it from CRAN with install.packages("infer").

#6.9.3 Extensions to ggplot
#The GGally package provides a suite of functions designed to make producing standard but somewhat complex plots a little easier. For instance, it can produce generalized pairs plots, a useful way of quickly examining possible relationships between several different variables at once. This sort of plot is like the visual version of a correlation matrix. It shows a bivariate plot for all pairs of variables in the data. This is relatively straightforward when all the variables are continuous measures. Things get more complex when, as is often the case in the social sciences, some or all variables are categorical or otherwise limited in the range of values they can take. A generalized pairs plot can handle these cases. For example, Figure ?? shows a generalized pairs plot for five variables from the organdata dataset.

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
organdata_sm <- organdata %>%
    select(donors, pop_dens, pubhealth,
           roads, consent_law)

ggpairs(data = organdata_sm,
        mapping = aes(color = consent_law),
        upper = list(continuous = wrap("density"), combo = "box_no_facet"),
        lower = list(continuous = wrap("points"), combo = wrap("dot_no_facet")))
## Warning: Removed 34 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning: Removed 34 rows containing non-finite outside the scale range
## (`stat_density2d()`).
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_density2d()`).
## Warning: Removed 34 rows containing non-finite outside the scale range
## (`stat_density2d()`).
## Warning: Removed 34 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 34 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 17 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning: Removed 21 rows containing non-finite outside the scale range
## (`stat_density2d()`).
## Warning: Removed 17 rows containing non-finite outside the scale range
## (`stat_density2d()`).
## Warning: Removed 17 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 21 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 21 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning: Removed 21 rows containing non-finite outside the scale range
## (`stat_density2d()`).
## Warning: Removed 21 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 34 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 17 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 21 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 17 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning: Removed 17 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 34 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 17 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 21 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 17 rows containing missing values or values outside the scale range
## (`geom_point()`).

#Multi-panel plots like this are intrinsically very rich in information. When combined with several within-panel types of representation, or any more than a modest number of variables, they can become quite complex. They should be used less for the presentation of finished work, although it is possible. More often they are a useful tool for the working researcher to quickly investigate aspects of a data set. The goal is not to pithily summarize a single point one already knows, but to open things up for further exploration.
library("ggplot2")
library("GGally")
library("ggExtra")
library("ggalluvial")
library("plotly")
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library("data.table")
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
pisa<-fread("C:\\Users\\faris\\OneDrive\\Documents\\ECON\\Applied Data\\pisa2015\\pisa2015.csv")

country <- c("United States", "Canada", "Mexico", "B-S-J-G (China)", "Japan",
             "Korea", "Germany", "Italy", "France", "Brazil", "Colombia", "Uruguay",
             "Australia", "New Zealand", "Jordan", "Israel", "Lebanon")

bin.to.num <- function(x){
  if (is.na(x)) NA
  else if (x == "Yes") 1L
  else if (x == "No") 0L
}

pisa[, `:=` 
     (female = ifelse(ST004D01T == "Female", 1, 0),
       sex = ST004D01T,
       
       # At my house we have ...
       desk = sapply(ST011Q01TA, bin.to.num),
       own.room = sapply(ST011Q02TA, bin.to.num),
       quiet.study = sapply(ST011Q03TA, bin.to.num),
       computer = sapply(ST011Q04TA, bin.to.num),
       software = sapply(ST011Q05TA, bin.to.num),
       internet = sapply(ST011Q06TA, bin.to.num),
       lit = sapply(ST011Q07TA, bin.to.num),
       poetry = sapply(ST011Q08TA, bin.to.num),
       art = sapply(ST011Q09TA, bin.to.num),
       book.sch = sapply(ST011Q10TA, bin.to.num),
       tech.book = sapply(ST011Q11TA, bin.to.num),
       dict = sapply(ST011Q12TA, bin.to.num),
       art.book = sapply(ST011Q16NA, bin.to.num))]

pisa[, `:=`
     (math = rowMeans(pisa[, c(paste0("PV", 1:10, "MATH"))], na.rm = TRUE),
       reading = rowMeans(pisa[, c(paste0("PV", 1:10, "READ"))], na.rm = TRUE),
       science = rowMeans(pisa[, c(paste0("PV", 1:10, "SCIE"))], na.rm = TRUE))]

dat <- pisa[CNT %in% country,
            .(CNT, OECD, CNTSTUID, W_FSTUWT, sex, female,
              ST001D01T, computer, software, internet,
              ST011Q05TA, ST071Q02NA, ST071Q01NA, ST123Q02NA,
              ST082Q01NA, ST119Q01NA, ST119Q05NA, ANXTEST,
              COOPERATE, BELONG,  EMOSUPS, HOMESCH, ENTUSE,
              ICTHOME, ICTSCH, WEALTH, PARED, TMINS, ESCS,
              TEACHSUP, TDTEACH, IBTEACH, SCIEEFF,
              math, reading, science)
            ]

dat <- dat[, `:=` (
  # New grade variable
  grade = (as.numeric(sapply(ST001D01T, function(x) {
  if(x=="Grade 7") "7"
  else if (x=="Grade 8") "8"
  else if (x=="Grade 9") "9"
  else if (x=="Grade 10") "10"
  else if (x=="Grade 11") "11"
  else if (x=="Grade 12") "12"
  else if (x=="Grade 13") NA_character_
  else if (x=="Ungraded") NA_character_}))),
  # Total learning time as hours
  learning = round(TMINS/60, 0),
  # Regions for selected countries
  Region = (sapply(CNT, function(x) {
  if(x %in% c("Canada", "United States", "Mexico")) "N. America"
    else if (x %in% c("Colombia", "Brazil", "Uruguay")) "S. America"
    else if (x %in% c("Japan", "B-S-J-G (China)", "Korea")) "Asia"
    else if (x %in% c("Germany", "Italy", "France")) "Europe"
    else if (x %in% c("Australia", "New Zealand")) "Australia"
    else if (x %in% c("Israel", "Jordan", "Lebanon")) "Middle-East"
    }))
  )]
#Create a plot of math scores over countries with different colors based on region. You need to modify the R code below by replacing geom_boxplot with:

#geom_point(aes(color = Region)), and then
#geom_violin(aes(color = Region)).

ggplot(data = dat,
       mapping = aes(x = reorder(CNT, math), y = math, fill = Region)) +
  geom_boxplot() +
  labs(x=NULL, y="Math Scores") +
  coord_flip() +
  geom_hline(yintercept = 490, linetype="dashed", color = "red", size = 1) +
  theme_bw()

ggplot(data = dat,
       mapping = aes(x = reorder(CNT, math), y = math, color = Region)) +
  geom_violin() +
  labs(x=NULL, y="Math Scores") +
  coord_flip() +
  geom_hline(yintercept = 490, linetype="dashed", color = "red", size = 1) +
  theme_bw()

ggplot(data = dat,
       mapping = aes(x = reorder(CNT, math), y = math, color = Region)) +
  geom_point() +
  labs(x=NULL, y="Math Scores") +
  coord_flip() +
  geom_hline(yintercept = 490, linetype="dashed", color = "red", size = 1) +
  theme_bw()

#geom point takes the longest time to load. However, geom_boxplot is sill the easiest plot to read than point and violin.

#We can also create histograms (or density plots) for a particular variable and split the plot into multiple plots by using a categorical, group variable. In the following example, we use x = Region in the mapping in order to identify different regions in the distribution of the science scores. In addition, we use facet_grid(. ~ sex) to generate separate histograms by gender. Note that we also added title = "Science Scores by Gender and Region" as a title in the labs function.

ggplot(dat, aes(x=science, fill=Region))+
  geom_histogram(alpha=0.5, bins=50)+
  facet_wrap(~sex)+
  labs(x="Science Scores", y="Count")+
  theme_bw()

#If we are interested in visualizing multiple variables, plotting each variable individually can be time consuming. Therefore, we can use the ggpairs function from the GGally package to build a more complex, diagnostic plot for multiple variables.
#In the following example, we plot reading, science, and math scores as well as gender (i.e., sex) in the same plot. Because our dataset is quite large, plotting all the data points would result in a highly complex plot where most data points would overlap on each other. Therefore, we will take a random sample of 500 cases from each region defined in the data, save this smaller dataset as dat_small, and use this dataset inside the ggpairs function. We colorize each variable by region (using mapping = aes(color = Region)). The resulting plot shows density plots for the continuous variables (by region), a stacked bar chart for gender, and box plots for the continuous variables by region and gender.

# Random sample of 500 students from each region
dat_small <- dat[,.SD[sample(.N, min(500,.N))], by = Region]

ggpairs(data = dat_small,
        mapping = aes(color = Region),
        columns = c("reading", "science", "math", "sex"),
        upper = list(continuous = wrap("cor", size = 2.5))
        )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Interpretation:
#What can we say about the regions based on the plots above?
#Students in Asia have the highest PISA scores on average, followed by Australia, Europe, North America, Middle East, and South America

#Do you see any major gender differences for reading, science, or math?
#No, there is no major gender differences for all subjects overall

#What is the relationship among reading, science, or math?
#The correlation among the subjects are very high, especially for science-reading and math-science. However, the math-reading correlation is smaller in comparison, but still very high in absolute terms.
#Create a scatterplot of socio-economic status (ESCS) and math scores (math) across regions (region) and gender (sex). Use geom_smooth(method = "lm") to draw linear regression lines (instead of loess smoothing). Do you think that the relationship between ESCS and math changes across gender and regions?

ggplot(dat,aes(x=ESCS, y=math))+
  geom_point()+
  geom_smooth(method="lm")+
  facet_grid(sex~Region)+
  theme_bw() +
  theme(legend.title = element_blank())
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 5053 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 5053 rows containing missing values or values outside the scale range
## (`geom_point()`).

#the relationship between ESCS and math does not change accross gender and regions.
#Create an alluvial plot for the survey item (ST119Q01NA) of whether students want top grades in most or all courses by region (Region) and gender (sex). Below we create the summary dataset (dat_alluvial2) for this plot. Use this dataset to draw the alluvial plot plot. How should we interpret the plot (e.g., for each region)?
dat_alluvial2 <- dat[,
                     .(Freq = .N),
                     by = c("Region", "sex", "ST119Q01NA")
                     ][,
                       ST119Q01NA := as.factor(ifelse(ST119Q01NA == "", "Missing", ST119Q01NA))]
dat_alluvial2$ST119Q01NA
##  [1] Agree             Strongly agree    Agree             Disagree         
##  [5] Strongly agree    Strongly disagree Missing           Missing          
##  [9] Disagree          Strongly disagree Missing           Agree            
## [13] Missing           Strongly agree    Disagree          Strongly agree   
## [17] Agree             Disagree          Strongly disagree Strongly disagree
## [21] Strongly agree    Agree             Disagree          Strongly agree   
## [25] Agree             Disagree          Missing           Strongly disagree
## [29] Strongly disagree Missing           Strongly agree    Agree            
## [33] Disagree          Agree             Missing           Strongly disagree
## [37] Strongly agree    Disagree          Missing           Strongly disagree
## [41] Strongly agree    Agree             Strongly agree    Agree            
## [45] Strongly disagree Strongly disagree Missing           Disagree         
## [49] Missing           Disagree          Agree             Agree            
## [53] Disagree          Strongly agree    Strongly agree    Disagree         
## [57] Strongly disagree Strongly disagree Missing           Missing          
## Levels: Agree Disagree Missing Strongly agree Strongly disagree
levels(dat_alluvial2$ST119Q01NA) <- c("Strongly agree", "Agree", "Disagree",
                                      "Strongly disagree", "Missing")

ggplot(data=dat_alluvial2, aes(axis1=Region, axis2=ST119Q01NA, y=Freq))+
  scale_x_discrete(limits=c("Region", "Students want top grades"), expand=c(.1,.05))+
  geom_alluvium(aes(fill=sex))+
  geom_stratum()+
  geom_text(stat = "stratum", aes(label=after_stat(stratum))) +
  labs(x = "Demographics", y = "Frequency", fill = "Gender") +
  theme_bw()

#For all regions, majority of students have either strong disagreement or strong agreement about wanting top grades, except for Middle East, in which the majority of them chooses "Agree". In all other countries, the students who choose strongly disagree are somewhat more than who choose strongly agree.
#Replicate the science-by-region histogram below as a density plot and useplotly to make it interactive. You will need to replace geom_histogram(alpha = 0.5, bins = 50) with geom_density(alpha = 0.5). Repeat the same process by changing alpha = 0.5 to alpha = 0.8. Which version is better for examining the science score distribution?

p6 <- ggplot(data = dat,
       mapping = aes(x = science, fill = Region)) +
  geom_histogram(alpha = 0.5, bins = 50) +
  labs(x = "Science Scores", y = "Count",
       title = "Science Scores by Gender and Region") +
  facet_grid(. ~ sex) +
  theme_bw()

ggplotly(p6)
p7<- ggplot(data = dat,
       mapping = aes(x = science, fill = Region)) +
  geom_density(alpha = 0.5) +
  labs(x = "Science Scores", y = "Count",
       title = "Science Scores by Gender and Region") +
  facet_grid(. ~ sex) +
  theme_bw()

ggplotly(p7)
p8<- ggplot(data = dat,
       mapping = aes(x = science, fill = Region)) +
  geom_density(alpha = 0.8) +
  labs(x = "Science Scores", y = "Count",
       title = "Science Scores by Gender and Region") +
  facet_grid(. ~ sex) +
  theme_bw()

ggplotly(p8)
#The geom_density looks better in observing the distribution of science score than the original histogram as we can see the count of every score better. The alpha=0.5 is also better because the density are overlapping, hence we can't see the difference between regions clearly with alpha=0.8
#We want to examine the relationships between reading scores and technology-related variables in the dat dataset that we created earlier. Create at least two visualizations (either static or interactive) using some of the variables shown below:

#Region
#sex
#grade
#HOMESCH
#ENTUSE
#ICTHOME
#ICTSCH
#You can focus on a particular country or region or use the entire dataset for your visualizations.

ggplot(dat,aes(x=ICTHOME,y=reading,color=sex))+
  geom_smooth(method = "loess",se=TRUE)+
  theme_bw()+
  labs(x="ICT Possession at Home", y="Reading Scores")+
  facet_wrap(~Region)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 67483 rows containing non-finite outside the scale range
## (`stat_smooth()`).

ggplot(dat,aes(x=ICTHOME,y=ICTSCH,color=reading>550))+
  geom_jitter()+
  theme_minimal()+
  labs(x="Home ICT Availability", y="School ICT Availability", title="High-Achievers have more ICT at Homes, but not necessarily at Schools")
## Warning: Removed 75492 rows containing missing values or values outside the scale range
## (`geom_point()`).

```{}